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ABSTRACT 

I. To dfetermine if tryqut samples^^typically usied fdr 

item selection contribttite to test bias against .minority gjroups, item , 
analyses were made of the California Achievement Tests using seven _ 
sufi-groups of the standardization sample: Northern^ Hhrte Suburban'^ 
Northern Black Urban, Southern White Suburban, Southern Black Rural, 
Southern White Rural, Southwestern Mexican Urban'and SouthKestern 
AnglorAmerican Suburban. The best half of the- items' in $ach test were 
selected for each group. Typically about 30 percent of the items in 
the upper half of the ^distribution of item-test cortelations for a 
group on "a test 'did not meet this critiricfn with another group. By 
this xriter;Lon minority groups were relatively simillar as were the 
three suburban groups. The resulting unique item 'tests did^ ttbt 
correlate well with each other. Scores of minority groups were 
relatively 'better on the\selected items. Thus, standard item 
selection procedures produce tests best suited ta groups like the 
majority of the trj.out sample and are therefore biased against other, 
groups to some degree. This degree varies. Ways to minimize this bias 
need to be- developed,'. (Author/MS) - - - 
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DESCRIPTION OF^THE STUDY 



. ♦ • Statement of the Problem . ' 

The standardized achievement tests, used in schools are ofteft said to be 'biased against, 
and \thus^ inappropriate for, children belpnging to disadvantaged racial arv<3 ethnic 
minorities. If this is so, then there aire two possible sources of such bias. The first may 
, originate in the preconceptions^arfd^though<; patterns of .the test item writers. The second 
may result from the customary ii<em Cry out and selection procedures used in test 
construction. This second possible source of bias is i,he general topic investigated in this 
: vtudy. . * - . - . - • V 

/ " ' ' . ' \ ' 

A number of problems occur when tryirtg to consider bias in achievement tesfs because 

the criteria of hias are nol completely clear When most recent writers (Cardall & 

Coffman 1964;, Potthoff 1966; Gleary & Hilton 1968; 'Messick & Anderson 1970; 

•< ' , -will • 

Green 1971) 'speak of bias, they say something about tests which measure' different 
things when used wlth'^different groups. How might item tryout apd selection procedures 

prociuce .such a result? 

■J ' \ ' ' ' ■ 

- „ ■ ■ 4 

The typi(?al procedure in building standardized achievement and^ptitude tests has 
remained essentially^uhc\i£inged'over many years (cf. Lord & Nov ifck fees. Chapter 15; 
Ruch^ 192^, Chapter 2). The first^^tep is to| develop a pool of items meeting various 
•specifications as to form /and content. Next these items are given to a sample of 
individuals — th^ step in question here. Various item statistics, such as point biserial 
j- correlations (item Vs. total score), are calculated .and the "best" items ar^ theQ^hosen; 
"^"best" is oustoijriarily characterized first jind foremost by a high relationship of the item 
|o the total score. Other cHaratteristics such*as difficulty and the effectiveness of 

dist r acters (in nmlliple-clioic^ test s) ar e als o co i ^id e r e d. Most of th ese~lbt4ef-4tem- 

characteristics are related to the item-test correlatiOVJsto some degree. Therefore, the 
items which "discriminate" best <i.e., show the highest relationship to total score) are' the 
* ' ones'.usually ch<5sen. Thi^ in turn means that the characteristics, or attributes, of the 
individuals in the tjjyout SEimple which" are most resppnsifcrfe foiulifferences in tot^ score 
, determine whiclxftems tend to be chosen and detern;iine, in effect, what the test measuries 
within the range of possibilities available in bhe item pool. That is, to deal with, the items,' 
the individuals tested call . upon certain qualities/attitudes, Icnowledge, or skills found in 
widely varying degrees in their group. The items most sensitive to these attributes of the 
tryout sample then geji selected. ^ • " , 

•^Consequently, the posribility exists that. the items selected are biased and discriminate 
against groups unlike the 'modal group in the tryout sample. If some atypical group- has 
, traits not prominent in the tryout sample and if these traits interact more strongly with 
^ the items th^n' do \he- attributes the group shares with tbs majority, then the tests will 
measure the distinctive characteristics of this 6 group rather than the trait or traits 
measured in -the more typical groups. Another possibility is that the atypical group is 
uniformly low ori the measured tracts, but not on other ecjually relevant, b\it upttieasured, 
' attributes/In either of these cases, one could say the resulting .test is bmrea. In the first 
irfetance, it is biased because it measures^ different things for different groups 
. unbeknownst to the. users; in the second instance, it measures on\y^ a portion of the' 
relevant behaviors but is taken to measure them all. *>• , ^ 



. i ■ 



GivTon circumstances such as those just described, ^then the use of'^'average" item tryout 
samples will result in the selection of item sets unsuitq^J to one or more of the various 
•racial, ethnic, cultural minority groups* in our schools. From this, it rnay follow that the 
use of a single tryout group can never solve the problem -r perhaps^^l^ ihe construction 
of separate tests would do so .ajthough this solution would have obvious drawbacks. 
Another alternative might be to use the same test but different item weights for different 
groups'. . . ' ' * 

The .need to consider such unattractive possibilities depends on how strongly the nature 
of a tryout sample determines the outcome of item selection. is custdro^ily assumed 
that the choice of people for item tryout-s does noj; have much effect, on item selection, 
although ''atypical" groups (suich as. disadv^intaged ch'ildre^ arl .usually avoided. This 
amounts to.the assumption that the test items function much the same wa'^Twjth airkinds 
of people. Some evidence for evaluating tlfiis assumption is presented in this report. ' 

Related Literature ^ 

> • ■■ 

Prior work on test bias dops not seem to have, dealt directly with these item tryout and 

selection procedures. In fact, as far as achievement tests are concerned',,very little work of 

jiny sort on the^att^r of bias appears to-be available. The work on/bias in intelligence ' 

and aptitude tests is more Extensive, but aspects of the bias issiye 'other ^than the one 

cohsideretl here have doTTiinated discussions. ^ ' 

/ ^ . . - ' • ^ ; • ' 

That children's intelligence test scores are related to their social' and economic status was 
.reported by Binet and others more thanr 60 years ago ahd has^'been' studied and argued 
about ever since. For a long time, these debates* largely stayed within the bounds of the 
' much older andJiigHly emotional nature-nurture controversy, perhaps because many felt 
that the then new' tests >c^uld §ettfe;^the issue (Teaman ,1916, pp, 19-20). Since the 
j^fHhe-ttrguhj^en^s-shews-^^o^ 



example, tlie response to 'Jensen 1969), that hope may be, considered unreasonable. In. 
any case, the test score differences -^favoring the n^^re privileged elements of society 
remain a fact (Colemarl et al. 1966). It may b'^ add^ that the accusatiiSns of the misuse 
and the misinterpretation of scores (Hunter ^ Rogers 1967"; Mercer 1971) are also factual 
in some, if not most, instances. " ^ j * 

However, the issue here*is ^he nature of the tests themselve§. This has not been as widely 
studied as it might be. Appryr^ntly , the first serious attempt to examine test items for bias 
•was' led by ; Allison Davis and his colleagues* 20 y^s ago (Bells et al. 1951). They 
^examined several existing group intelligenc;^ tests and 'the items.in them jn an attemf)t to 
determine \he factors built into the tests which are related to differences in performance 
between culture groups. They concluded: ^'Variations in opportunity for familiar 
cultUlral words, objects, or processes? required for^answering the test items seem^ . . the 
most Adequate general explanation. . (Eells 1951, p. 68). This sort of objection is also 
often made*" to achievement tests (Wasserman 1969) but is not a valid, basis for asserting 
bias in an achievement test, unless the missing knowledge is irreleyanj) to whaf is being 
measured! Consider the finding reported by Chang an^j^aths (1971) that achievem)(nl 
test items which discriminate between middle- and lowtr-class groups reflect a different 
curricular emphasis on the part of the teachers: This is more nearly teacher b\as than test 
bias. In an ability test, such objections have dirject logical merit. 



....••.^•r ' 

Interestingly, the subjects in the Eells study were all white and drawn from the schools of 
.'*a western industrial city of about <100,000 people." On^ result of tlie study was the 
publication the Davis-Eells Games (1953) which was designed Vo eliminate this kind of 
cultural bias. Three things may be noted about this test, which* is now out'of "print. First, 
the test proved to^ yield differences between middle^and low socioeconoipic status (SES)" 
groups (Angelino & Shedd 1955) as substantial as tliose found^itsing other, group 
intelligence tests. Second, Davis and Eells eliminated the items .that showed SES 
diffierences in difficulty pnly if they could rationalize the difference as a consequence" of 
-opportunity. "Lastly, th^ apparently did not look at the differences between SES groups 
with respect to item discrimination. The common interpretation of the outcome of the 
bavis-Eetls test and similar efforts by others ^has been that the task of building a 
"culture-free" or J*culture-fair'' test may be not only impossible but inappropriate 
because such a test would not be valid as a measure of general ability;-^ jndjeed^ was the 
case for the Davis-Eells Games' (Lorge 1966). ' ^^''j 



Supporting this view is work such as that of Lesser, Fifer, and CImk (1965). This study 
showed that patterns of ability are different for different ethni^groups/It also Aowed 
that within any one ethnic group, quantitative differences resulted/ from socioeconomic 
status, but the patterns f^r SES gro^ups Were ver^, similar .yTha^^is, the lower-class and 
middle-class groups of any one ethnic group had similar patterns/but the latter had higher 
scores. Such data imply that arty test measuring several abilities —'as most ability and 
achievement. tests do — is automatically stacking tlie, cards/gainst one ethnic group or 
another. * / . , 

Furthermore, Williams (19'3iO) reports that he has builta test biased in favor of blacks.-. His 
validation studies of the instrument as a measure of academic aptitude are not yet 
complete, but if Williams can c produce' a valid ability* test favoring blacks, then' it "is 
probable that most ability tests are biased. In the meantime, many people are taking this 
to be established fact, and afisfertij^ns^at group intelligence tests necessarily discriminate 
against various minority ^nd disadvantaged groups in our society have been .increasing in 
number and vehemence. S^e school systems (New York City, for example) have 
virtually abandoned the use of such tests (Gilbert i966). Similarly, some college- 
personnel now argue that the various placement and ability tests traditionally used are 
inappropriate (Brown & Russell 1964): ^ 

Many of the assertions majde about bias in ability tests' appear to be sound, but, as 
Ana^tasi (1968) has pointed ou^, bias in prediction involves a distinct ^et of issues. None 
of the preceding Considerations necessarily apply if the test in question is meant to be 
used as a predictor of some criterion performance. For example, if one defines bias as 
systematic under-prediction, then the attacks on the aptitude tests "used for college 
admissions appear largely unfounded. The claim that such -tests fail to function among 
disadvantaged minority students in the way they* do in other groups lacks supporting 
evidence. A series of studies at both the high school and college levels shows that 
academic aptitude tests frequently predict grades just as well for minority groups as they^ 
do for more privileged groups. Only the work of Green and Farquhar (1965) point* to 
a different conclusion among a half dozen or so studies bril this issue. In fact, some 
tests appear to over-predict the performance of lower-class and Negro students in 
contrast to middle-class and white students (Hewer 1965; Stanley & Porter. 19671;^^ 
Cleary 1968; Davis & Temp 1971). . . ' ♦ 



Even in this relatively well explored ^Irea, much remains to be done, such as finding ways 
to 'deal with the possibility of bias' in the criterion measure (Linn & Werts 1971). In' 
addition, there is often more than one reasonable definition of bias in criterion-related 
validity situations (Tborndike 1971; Darlington 1971). As Potthoff (1966) has pointed 
Out, the operational demonstration of bias is even more difficult and ambiguous when 
test validity cannot be defined as- the relationship of^cores to a* directly measurable 
criterion. Any test yielding scores meant to be an indication of status — be it in 
achievement, in intelligence,* or in what have you — creates such probleWs. 

One approach consistent with .the definition of bias offered at the start of this paper is to 
examine the items> rather than the whole test;, for bias. Here, bias may be defined as an 
item by group interaction. Thretvstudies (Cardall & Coffman 1964; Cleaxy & Hilton 1968; 
Angoff & Ford 1971 ) using this approach Tiave been reported. 

■ ' ' ■ ^ * a 

They each found statistically significant item by race interactions in the College Entrance 
Examination Board aptitude tests \yhich they u§ed (SAT and PSAT). NeverthelessrCleary 
and Hilton concluded that "the PSAT is not biased for practical purposes," while* Angoff 
and Ford suggested the "interaction was simply the difference in -performance levels on 
the test shown by the twq' races." These studies were based largely upon a consideration 
of item difficulties. ' ' , * 

Item interrelationships are also a releva^nt consideration. Data obtained by Kennedy et al. 
(1963) show that the grandfather of them all, the StanforcJ-Binet (Terman & Merrill 
I960), produced equal or higher item-test correlations for an all black southern sample* 
than was reported in either the 1937 of the 1960 standardization. Also, Merz.(1970) has' 
^reported that the factor structuf^of the Goodenough-Harris-Drawing Test is substantially 
the same for samples of black; wliite, Mexican,*mid Anglo children in the southwest. 

Inconiplete as this research on was in ability testa may be, it is way ahead of . that on bias 



in achievement tests which is e|$entially nonexistent. The claims of bias m achievement 
tests (Wasserman 1969; Williams 1970; Houston 197W need iJivestigatioriV 
of item by.g^oup interactioiy seems to be the logicaVplacrf to begin. Certainly it seems 
reasonable xo believe that la test based on itetas selected./or a particular group (such as^ 
inner city black childrien^ would be less biased against them and therefore more useful ^r 
them. " " ' ' . \ 



bjectives of the Study 



To explore such a possibility Hhis study compares the results of usirfg three disadvantaged 
minority groups — northern, ^lirPan\black; semthem, rural black; and southwestern 
Niexican-Amerigan — as tryout samples in contrast to white, advantaged groups in the 
same'regions. . . . 

The study attempts to Vietenhine whether or not an item tryout using these different 

groups would'lead to the selection of different' items from the item pool and, if so: 

. ■ y 

(1) Do the dif^erpnt items selected measure different things? 

(2) Are the resulting iti^sets "better" for the minority groups in the sense that they 
are more .reliable and h^ve better functipniifg items, (higher point biserial 
correlations)? ^ ; . V 

(3) Will ,the relative discrej)ancy in scm-es favoring majority groups be reduced by using 
a minority -tryout group? 



Limitations of the Study \ . 

The major limitation*of th^sJ'stUdy is the restricted nature of the item pool: all items 
come from an already published test. They are therefore preselected and may be limited 
in their possibility of eliciting differential* reactions irdm the sample groups. Also, it 
should be noted that grade and test level are not independent;* the test- levels were 
designed to be continuous and to articulate well, but they are different tests. Thus, the 
assumption made throughout the following niateYial that grade differences are njieaningful 
may not be justified. Finally, because of limitations of time ond money not all relevant 
analyses of the data could be made. \ • . 



^ METHOD- . . 

The basic data for this study were derived from' that .obtained during the standardization 
of the California . Achievement Tests, 1970 Edition (CAT-70) published by CTB/ 
McGraw-Hill. The CAT-70 is a general achievement battery with five overlapping levels. It 
was designed t9 measure educatio°nal -attainment 2aid to provide an analysis of learning 
difficulties. It is basically similar to the 1957 edition and generally measures: 

(1) th'e ability toT3^nderstand the meaning of, the content material presented;. 

• (2) the performance of the stude^nt i!n applying rules, facts, concepts, conv^tions, and 
principles to solve problems in the basic curriculaF- material, aild 

(3) the level of performance of the student in using the tools'of reading, mathematics, 
and language in progressively more complicated situations. 

The tests in the battery which were investigated in this study ^e Reading Vocabulary, 
Reading Comprehension,^ Total Reading, Mathematics Computation,' Mathematics 
Concepts and Problems, Total Mathematics,' Language Mecharncs, Language Usage and^ 
Structure7-md'^otat7Language. Total Reaamg, lotal Mathematics, and Total Language 
werfe treated as tests ^parate from their parts. The standardization took place early in 
1970 and involved over 200,000 students*in about 400 schools. The sampling design 
called for obtaining a sample of school distiicts stratified by region (seven areas), school 
^district size (three categories by average enpollment per grade), comniunity type (urban, 
towo, rural, other), and control (public or. parochial). WithiVi^the districts, schools were, 
chosen randomly for each test level, and'dll students in the selected schools^ who were iri 
appropriate grades took th6 test. ' ' , ^ 

The items in . the battery came from a variety of sources, but it is fair to say that they 
^ere writteri by and for "middle America." The tryout sarpples also fit'this description. 
Thusy the tests should favor white, middle-cla^s Americz&is if they favor any group. ' • 



Sample 

All schools, participating in the CAT-70 standardization answered questionnaires ^hich. 
provided information on the basic character of the area ser/ed (e.g., residential subur^ 
inner part of a large city, etc.,), the percentage of white students, the .percentage of . 
children from homes where another language if: .spoken, and the percentage of childrert ip 
families falling in each of four SES groups defiped by parlental occupation (professional- ^ 
managerial, white collar, skilled, unskilled). / ' f 



From the data on these questionnaires, seven groups of schools were drawn for this study. 
The characteristics and. sizes of these groups are shown in Table 1, The samples used in 
this* study' are drawn from schools serving pupils highly homogeneous with respect to 
ethnic background and rather homogeneous with respect to socioeconomic status. Only 
at Grade 10 was it not possible always to find, school^ meeting these criteria in the 
standardization population; sufficiently* segregated tenth grades were found ^only in the 
South. * . • ^ , ' 

The groups were paired for comparisons as follows: f ' 

(1) Northern^ black, central city versus northern, white, suburban (II vs, I) 
(2-) Southemv black, rural versus southern, white, suburban (IV vs. Ill)* 

(3) Southern, black, rural versus southern, white, rural (IV vs. V) / 

(4) Southwestern", lyiexica^-Americah versus southwestern, Anglo-American, suburban 
• (VI vs. VII) ' , - > • 



Table 1 

CHARACTERISTICS OF THE SAMPLE GROUPS 



Group, Geographic^ Residential 
Number • Region Type , 



Ethnic Socioeconomic 
Group : Status 



Number ot Cases by Grade 



1 



8 



10 



I 

II 



III 



IV 
't 



VII 



North 
North 



Residential . 
Suburban 

Central 
City 



White. 

Black 
(99%) 



South 
South ■ 
South 



Residential 
Suburban 

Rural 



Rural 



Southwest ' Small and 
Large Cities 



Southwest' City and 
Suburban 



White 
^ (99%) 

Black 
(100%) 

White 
(91%) 

Mexican-'^ 
American . 
(87.%) . 

Anglo- 
American 
(99%) ■ 



High 
(81%)^ 

Low 
• (81%^ 



High 
(77%) , 

' iv 

Ldw \ 

(96%) ; 

Low 
(81%) 
i ■ • 

Low 
(82%) 



High; 
(81%) 



299 225 265 328' 



285 304 278 250 



361 211 293 304' "279 



202 220 .171 245 1'93 



323' 200 199 296 ,246 



146 144 169 399 - 



189 218 249 277 ' 



^he states contaiping these particular school systems are— North: Illkiois, . Indiana, Kansas, 
New Jersey; South: Alabama, Georgia, South Carolina; Southwest: Ai^zpna, Oklahoma, Texas. 

..." ■ _ ■ , . :' ■ \' : ' • ■ ■ ■ ' 

"Estimated per cent of cases falling in the category, • / . > / 

^81% speak mostly Spanish at home. , : : ^ 



Enough schools meeting the" appropriate criteria to provide bi^ween 150 and 300 
students for each group at each of five gradei levels were selected. Each of the grade levels 
(1, 3, 5, 8 and ^0) corresponds 'to a different levehof the CAT-70 battery. 

Girade .10 comparisons were" made in the South. qnly. No analyses were mad^.of the Tota' 
Language scores in Grades 1, 5, and 8 f6r the northern, w^ite group and i 
Grades i and 8 for the northern, black group. Therefore, of the 315 possib 
analyses (7 groups x 9 tests x 5 grades), only 27^4 separate analyses were made. 

Data Analyses *^ ' ' . - v. 

The basic procedure used for examining the data was an item selection routine. Each of 
the seven groups was^Sfereated as a tryout sample-with the items in each test functipning'as 
an item' pool. For each 'group on each test at eaSr-§tade, the "best"* half of- the 
items (i.e., thos9r with the highest item-test correlations) were noted. Four kinds-of " 
analy^s were made: \ 

(1) iTie npmber and per cent of items chosen for one group in the pair but"|ibt for the 
other was' recorded. These items were labeled "biased." The number of these 
"biased items in any one comparison indicates the degree to which the two groups 
interact in -a distinct manner with the test i|j|nis. All 21 possible pairs of groups 
were compared in this way; the remaining analyses were made only for the four^ 
pairs Jisted previously. V 

(2) Scores for_each group in a pair were obtained on both sets of biased items. These 
two tests may be called the "majority biased test" and the "minority biased test" 
since they contain the items uniquely best for the respective groups. The 
correlation between each group's score, on the two tests v/as found^ Frpni these 
correlations, es;^imates of the variance nt)t common to the two biasfed item tests 

wprp madg to jud ge how different the sets of items really are in what they 

• measure. Thus, this analysis supplements the first.^ ' - „ " 

,(3) Another analysis consisted of examining and comparing full-test land half-test 

KR 20 reliability estimates since differential reliability would be aj form of bias 

" indicating that the test scores have a' larger enor component in one group than 

they do in another group. / 

-• ' • * ■ ■ I 

(4) Finally, 'mean scores on the full-test, the half-test, and the' biased item tests were 

exkmihed for changes in relative status of the groups as a result of item selection. . 
* ^ RESULTS \: 

• ' ' ' ' ' ' . • ■' 

Proportions of Biased Items. ' ^ ^ V * 

The item selection routine yielded a series of tests "best" for each group, half as long as 
the original test — when N was odd, the expression. (N + l)/2 was used to determine the 
length of the half -test. The next step was to identify those items selected for only one of 
the two members of a pair — the so-called biased items. Obviously, the nuniber of biased 
items has to be the same for each* group in a pair. This number as a proportion of the 
itiems in each' half-test is an^ndex of the degree to which the item selection procedure 
produces a different test for the two groups. . * 



Table 2 exhibits these pVoportions for the four basic comparison groups. Th^ proportions 
do not appeaf to vary systematically by grade or test. Ho.wever, certain groups appear 
considerably more like each other than are others by the criterion of the relative size of 
these proportions. It can be readily seen from Table 2 that the cjifferences between^ th»^. 
Mexican-American and Anglo groups tend to be larger than those bet>(^reen the black anc|^ 
white pairs. ' 

. . * Table 2 ' " . , \ 

■'']■• 

" ' PROPORTlblMS OF BIASED ITEMS FOR COMPARISON GROUPS 

BY GRADE- AND TEST 











Test 


Number of 


Comparison Groups 






. Items Selected II vs. i 


IV vs. Ill fV vs. V VI vs. VII 





Grade 1 

Vocabulary 
Comprehension 
Total Reading 
Computation 
Concepts'& Problems 
Total Mathematics 
Mechanics 
Usage & Structdre 
Total Language 



Grade 3 



Vocabulary - 

Comprehension 
. Total Reading 

Computation 

Concepts & Proble-ns 

Total Mathematics 

Mechanics 
. Usage & Strticture 

Total Language 



Grade 5 - 

Vocabulary 

Comprehension 

Total Reading 

Computation 

Concepts, & Problems 

Total Mathematics 

Mechanics 

Usage & Structure 

Total Language 



46 


.41 


.3.5 


.35 


.59 


12 " 


.25 


".58 


.33 


.42 


58. 


.40 


.36 


' '.34 ■ 


.69 




.15 


.25 


.40 


\.25 


24 


.42 


.38 


.42 


.58 


44 


.16 


.25- 


.23 


.41 


19 


.42 


.21 


< .21 


.58 


10 


• .30 


.30 


.40 .. , 


.40 


39 




.24 


.27 


,54 


20 


.30 


.65 


. ..35 


' .45 


23 


_.22 


.26 


.22 


,35 


43 


.28 


' .42. 


.28 


.33 


36 


.17 


.28 


■ .22 


.25 


23 


.35 


' .48 „ 


.35 


-/.43 


59 


.29 


.32 


.30, ' 


.32 


33 


.48 


-.42' 


.30 , 
\ .23 


■ .45 


13 


.31 


.-..46 - 


.46 


46 


/ .41 


.30 : 


.28 


. .48 



20 


.50 . 


.55 


, .35- 


-.-Vo 


21 


.48 


.43 


'.29 ' 


' .52 


41 


, .46 


-46 ■ ■ 


.37 


.61 


34 


.41 


.38 


.21 


.41 


20 


.50 


.40 


.20 


.55 


54 


. - .44 


. .46 


. .20 


.46 


40 


-.45 


.35' 


.25 


.53 


21 


.33 


.48 


.38 


. .33 


61 




.30 - 


; .16- 


.26 



Table 2 (Continued) *' - 



PROPORTIONS OF BIASED ITEMS FOR COMPARISON GROUPS 
' ' BY GRADE AND TEST 



Test 


• 

Number of - 


■ 


Compa 


rison Groups 




, . ■ . ' 


Items Selected 


II vs. 1 


IV vs. Ill 


IV Cs. V V 


vs. yii 


Grades' 












* Vocabulary. , 


■■■?Si 


.40 


.15 


.15 . 


.45 


Comprehension 
Total Reading 


^3 


.22 


.39 


.30 


.39 


4a 


.26 


'.23 V 


.21 


.44 


Computation 


24 


' .25 


.46 


.29 


.29 


Concepts & Problems 


25 


.36 


\40 - 


.36 , 


.28 


Total Mathematics ^ 


49 ■ 


.29 


.49 


.35 


. .29 


Mechanics 


36 " 


.42 


.33 


.42 


.39 


' Usage & Structure 


25 


.36 


.56 


.32 


.16 


Total Language 

. 1 ' 


61 / 




.15 




.18 


Grade IQ 






t 






Vocabulary 


20 




.53! 


.40 


— 


Comprehension 


23 




.22 


.22 




Total Reading ' ' 


43 




.42 


.30 




Comput^ipn 


' 24 




.33 


.33 




Concepts Problems 


25 ' 


\ - 


.40 


■ .32 




T^otal Mathematics 


49 • 




° .33 


• .24 




Mechanics \ 


40 . 




.38 


.35 




Usage & Structure 


27 




.41 


'.30 




TotsJ Language^ ' 

' \ ' 


67 




.21 


.19 




Mediah pri^poitions 












for all tests and all graicles 




.36 


.38 


.30 


" .43 



The medians of these proi^oitions for all possible pairs are shown in Table 3. TJie overall 
median proportion is approximately .30. -^s expected, the white, middPerclass groups are 
consistently- more like each other (these pairs have -lower medians) than th^y are like the 
minorify^roups. The latter also have more in common than they share with the three 
majority groups. The southern, rural,' w^ite group does not fully fit into this otherwise 
clear pattern; in general, they appear more like the three minority groups than they 
resemble the* three suburban groups. Of course, economically they are undoubtedly more 
disadvantaged than the suburb£ua groups, albeit much less so than* the southern, black 



group; 



■ 



> ■ 



• Tables. 

MEDIAN PROPORTIONS OF BIASED ITEMS 
FOR EACH PAIR OF GROUPS' 



GrouD 'i 


1 


II . 


IN ' 


IV 


— ? 

V . 


VI 


VII 


1 ■ 

1 • 




.36 




.OO . 




O Q 
.OO 






.36 




.33 


.26 






". .41' 


III 


.26 


.33 




.38 ' 


.30 • 


.33. ^ 


. .27 


IV 


.35 


.26 ^ 


.38 




'-30, y' 


.30- 


'.41 ■ 




'.30 


. .-25 •• ' 


.30 


.30 . 




.24 


.33 


. VI 

• 


.38. 


• ' ' .25 \ 


.33 


.30 


.24 




.43 


VII 


\26 


-.41 , 


.27 • 


.41 


.33 


.43- 





In^j 



Inci^pendence of Biased Item TestSy ' |. , 

All groups differ from their pairs to some degree by the criterion of projportion biased 
items, and some of the diffei^enoes appear to be substantial. However, it is possible th?t 
these sets bf biased items Still measure much the same thing. To examine this possibility, 
scores i^or each individual were obtained on both biased ifem tests. This was possible since 
each individual, answered all items. 'TJie correlations between these two scores were 
obtained for each group on. each t^st. Thefee correlations varied from —.17 to +,8j2 with a 
median of about .5 which! leaves ^ lot of variance unaccounted forf Since/|the number of 
biased items w^s very smali-in many cases, the reliabilities olthe.biased tests are typically 
low. But even allowing for this, it appears' that ifti many instances, the majority 'and 
minority tests measure quite different things and as a rule do so for both groups involved. 



Changes in Jest ChaVacteristics 



\ 



A special case of bios occurs if Hie test scores \3(| one group contain substantially more 
error than they dp^for^anoiher gr6uf). The overall niedian KR !30's on the fullt^ests 
groups I through VII are .91,. .91,^.91, .92, .93/ :9q, 'and .92, respectively. Obviously, 
there, is little evidence of bias by this criterion, although a test-by -test cojnparisonr of • 
these reliabilities Sliows that the figures are mostly higher for the jnajority group (97,of 
162 comparisoils). The' data concerning half-test rieliabilities ' also show a -ver^ ^i^^^l 
amount of bias. ^ < ^ . : • . • . ^ ^ * * 

* The item-test correlations after, item sclectibn show only slight^lhfiproyements 
uniformity of the increases prevents one from inferring the presence bi, substantial biasi?; 



i 



Changes in Test Scores 

Another way to look at bias is to assert that the scores of some groups are unfairly low 
because the test' does not adequately measure all the relevant abilities or knowledge, knd, 
in particular, does not measure' well those relevant attributes on which the group in 
question happens to' scc/e well. If the item pool contains items which measure these 
attributes .at all, -a Selection routine using this group might be expected to increase the 
importance of\ these attributes in determining the total score, thereby reducing the 
disadvantage of the group. Therefore, the three minority groups cprtside^ed here fnight be 
,expecte4to do relatively bettei- on the items selected as best fo^ them than they did on 
the original full-test. Each group's full- to half-test improvement on each of the njne tests 
in the batteryi wau compared to the improvemipnt shown by its comparison group. Table 4 
repdrts the numbet of tests on which a group showed more full- to half-test improvement 
than 'was shown b^ its comparison^group. The minority groups showed greater relative 
^improvement consistently in the upper grades, but not in Grade^ ^nd 3. As was the case ^ 
^for^ proportions, of biased' items, the southern, liiral, white groupi does not fit the 
patteai^therliemjel procedure helped them as often as it helped the rural black's,, 

perhaps because' theii initial sCOTes^^vere o^nore alike to begin with,'especially in the . lower 
grades. 



( 



fable 4 

NUMBER OF TEST^^ON WHICH EACH GROUP 
SHOWED- MQ^E^FUUL- TO HALF-TEST. 
MEAN SCORE GAIj)! THAP^ ITS COMPARISON GROUP^ 



Comparison Groups >. 
Grade. II & I IV&III W&V N/K&VII 



Totals 
Min. Maj. 



1 

3 

^5 
8 
10 



7 
2 

'7. 



1^ 
7 



Totals 



> - ot 

.24 9 
6.8 
■ .01 



1. 
8 
8 



8 
1 
1 
3 
3 



'0 9 
5 4 



1 
7 
V7 



8 
2 
2 



7 ■ 2 
4 5 

8 1 
6 3 



29^ IR 
3,15 



.05 



20 25 
0.6 



NS 



.25- 11 
5.4 
.02 



15 2d 
-19 IT 
24 11; 
27 .8 
.13 5' 
•98 61 
8.6 
-.01 



'0.7 
0.1- 
4.8 
10.3 
.3.6 
8.6 

\ 



NS. 
NS, 
.05 

.61 

.10 
.01 



\ 

\ 



I' 



^Let. Y = majority group mean, X = ^minoritj^ group mean, and /let £_.anfi ^ 
represent* full-test and \alf-tjes.t,^respec^iveiy. Then - l?^ - 2(7j^ - Xj^) ^n;0 
favors minority; Yf - X^ -"^ 2(7^^ - Xj^) < 0 favors majority. ^ J \ 

''Note that ^alysbs^ were nol| made for the Total Language of. the CAT-70 for 
this group at this grade. Therefore, comparisons were made' for* onlji. ^^8^^ tests; 

13 • 



7 



The majority biased item tests are almost' uniformly more difficult for both grpifps than 
/ are the- minority biased item tests, In addition, the differences -between majority group 
mean scores and minority grouR i^ean-scores are usually smaller on the minority biased 
item tests than on the majority biased item tests. Table 5 shows the frequencies of this 
phenomenon. The biased tests are clearly biased in favor of the group used as the basis for 
^'teteetion i&nd this result tends lo hold for all groups at all grades. The disadvantaged group 
, is less dlssydvantaged when tested \vith items-selected as uniquely best for thetn.^In other 
words, the data^show that the relative adv^antage of majority groups is reduced when ^sing 
items Ghdsen ^ best for the minority group but is increased v/hen using items chosen as 
best for themselves. ' , - 



Tablets ^ 

NUMBER OF'jCCJMPARISONS IN WHICH MEAN DIFFERENCE 
ON BIASED ITEM TESTS 
FAVORS EACH GROUP^ 



Comparison Groups 



• 1 Grade . ^ 


II & 1 


IV & III 


IV &V 


VI & vn 


Totals 




. P 


















Min. Maj. 






/ ■■ 1 ■ 

• 


5 
5 


3b 
4 


— ^ 
6 

'5 


4 


8 

3 


. 1 


8 
7 


1 . 
2 


27 8 
. 20 16- 


, 10.3 
0.4 


.01 

NS 




■7 


lb 


5 


4 


7 


2 


7. 


2 


, 26 9 


8.3, 


.01 


8 • 


8 


ob 


. 9 


0 


6 


.3 


5 


4 


, 28 7 


12.6 


.001 


10 






6 


3 ; ' 




4 








0.9 


NS 


Totals 

. L 1 


25 


8 


31 




29 


16 ' 


27 


9 


112 47 








8.8 




6.4 




3.8 




9.0 


26.6 . 






P 




.01 




.02 1 




.05 




.01 


.001 







1/ > 



^Let = majority mean on majority Jfst, Xjjj = minority mean on majority test, 
Yjj = majority mean on minority test, and X^^ = minority rtiean on minority test, 
l-hen, - X^^ > (Y„ - X„) favors minority; Y^ -X^ < (Y„ - X„) favors 
^majority. . . ■■■.-.] 

b^Note that analyses were not made for the Total Language of the CAT-70 for this 
- group at this grade. Therefore, comparisons were rna^e for onl^ eight tests. ^, 



ERIC 



12 
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CONCLUSIONS 



The-four analyses of the data described previously permit the following conclusions: 

(X) Different tryout samples lead to the selection of somew^t different sets of 
- items. Considering the restrieti'cfti on rangte and 'variety of points of view ^ 
represented- in the item pool, the 30% proportion of biased items, which was the ' 
average^found in this study, seems large. That is, it seems likely that a majority of 
biased items would have been selected if the item pool had been more 
^ heterogeneous. i u 

(2) The more economically dissimilar the groups cTontrasted, the less likely it is that 
/they will produce data leading to the selection of the same set of items. 

(3) If a biased test is a test that contains a substantial proportion of items that would 
not have been selected' had they been tried on some other particular group, then 
probably most tests are biased against most groups. 

(4) By this criterion of bias, the tests used here are more biased against minority 
groups than against middle-class, white children. This is probably true for most 

' published batteries of standardized tests, \ . . . - 

(5) The proportion of biased items is a fairly good, but uneven, criteQon of /bias since 
in most cases the biased item tests do 'measure diflferent things. JWhat is measured 
depend^ oA which group is usegl for selection and which group is being tested, 
this conclusion is not uniformly trUe and varies widely according to test, grade, 
and tryout group. . . ; --^ 

(6) The psychometric qu£^ity^bf the half-tests was only very slightly better than that 
of the Qriginals. Tha^is, the effect of Wje item selection procedure was small, 
presumably 'because all the items were already a prbduct of- an item selectio 
procedure and because the battery is rather homogeneous in/ style and^prfTTc 

• " view. ' . ^ . _ • . . ' ' 

(7) The- half-tests were barely more reliable for the minority groups than 'for the 
majority ^oups, but this improvement is small in both kinds of groups and 
suggests minimal bias of this sort in the battery. <J 

(8) The use of itemsrparticularly suited to a tryout group will improve the chai^ces of 
good scores among individuals from Sirhilar Igroups. This outcome may be more 
likely in the upper grades. . *** | 

(9) Th.Q amount o ^ relative improvement in score that a minority group could expect 
to gain by using tests built with tryout groups like-itsdf does'no^ appear to be 
very large. %his* relative improvement is most unlikely to ovefcome any large 
discrepancy, between typical scores in that group and those in more favored 
groups^ . . * 

(lOy It should be possible to build tests somewhat biased in favor of any group by 
using a fair slWiple of that grbdp for item~*selection data. 



V RECOMMEISlDATIONS ANp QUESTIONS 

l^he conclusions strongly suggest that there should be some changes and additions to the 
.tq^ construction procedures commonly used whenever there is a possibility that. the 
resulting insttulnent will be used witlj pebole belonging to ^group. ethnically or culturally 
different from the test builder's pnncipju* reference group. Clearly, the first additional 
step is to obtain. data on all relevant groups separately. It ^s important to note that if a set' 
of items is likely to measure different attributes in different groups, the^rnajprity group in'*, 
a try out ssKfnple will determfne which attributes are most strongly measured and the odds 
;are that the inclusion of one or more mioorities will merely obscure the is'^ue. Just ae the 
degree of minority representation . in standardization samples can have only a small 
influence on norms, minority group presence in tryout samples dominated by some solid 
majoritj^yUl riot accomplish much. , * j * ^ 

What is needed is a way either to (1) select unbiased items^,'^'(2) compensate for known 
bias by establishing alternate weighting .and scoring schemes, (3) interpret scores 
according to the group membership of the exr.minee, or at least (4) acknowledge and 
document the existence of the bias and its ^effect on scores. Until more experience is 
available in using various kinds of separate tryout groups, it is not reasonable to state a 
preference among the^e options; a number of questions need to be answered first, such 
as: . . 

(1) What proportion of items tried ca?i one expect to find "unbiased*' by each of 
various criteria? 

("2) Can one expect* simple scoring arjd weighting schemes to reduce bias? 

(3) Are the same criteri^ measures appropffiate for all groups? 

(4) What sort of indices of bias could onei offer that would be re^dUy interpretable? 

If the only favorable procedure turji§,p.ut to be the last option, a test constructor could 
choose to bu^d alternate versions, each biased toward a different group; the problems 
created by adopting this procedi^re are large and many biit not necessarily insoluble. 

In additionffto exploration of the effects of variations in tryout groups, studies are needed 
on the rolC of points ol view, cognitive style, and/or ethnic background among those 
contributing to the itemjpool. Would blacks tend to create items more useful for black 
children? Many blacks believe so. It seems obvious that Spanish-sp^aft^g item writer^ can 
produce better items for Spanish-speaking children thsln could soraeDne who could not 
write in that language. Yet, we still often use English language tests with children whose 
native language is not English and claim to be measuring something other than facility 
with English. It is, of course, less obvious if the children /are fully bilingual. Are black 
children bilingual? 



The answers to these and many other questions one might raise are not obvious. What is* 
obvious is that it is no longer adequate for those who build tests to argue that bias is 
largely a matter of misuse or to say that they cannoiNsee why a particular test would be 
biased and thus ignore the matter. All testis are ndt neca^arily biased^ but any te'st may 
be. Urttil there are good answers to these questions, research on the matter should be a 
Standard part of producing a test. * " • ■. ^ ■> .; ■ • 
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